Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking
نویسندگان
چکیده
This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones. Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically. The proposed method is tested over eleven high-dimensional datasets (2400-46000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset. Preprint submitted to Knowledge-Based Systems November 3, 2010
منابع مشابه
Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملDeveloping a Filter-Wrapper Feature Selection Method and its Application in Dimension Reduction of Gen Expression
Nowadays, increasing the volume of data and the number of attributes in the dataset has reduced the accuracy of the learning algorithm and the computational complexity. A dimensionality reduction method is a feature selection method, which is done through filtering and wrapping. The wrapper methods are more accurate than filter ones but perform faster and have a less computational burden. With ...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملSFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy
In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Knowl.-Based Syst.
دوره 25 شماره
صفحات -
تاریخ انتشار 2012